And Non-Parametric Equivalents
For cases where we have things that can be put into two groups.
\[ P(x=K|N,p) = \frac{N!}{ K! (N-K)!}p^K(1-p)^{N-K} \]
Used as the generative function for Bayesian Inferences.
Can be used to test if a set of data are equivallent to an expectation given by the underlying probability of seeing an item in “Group 1” (catfish in this example):
\(H_O: p = \hat{p}\)
For this, the test statistics is defined as the number of observations in “Group 1” (e.g., \(N_{catfish}\) in this example), and what we know about the probability of selecting items that have two distinct states is used to evaluate the likelihood of getting \(N_{catfish}\).
Borrowing from a previous example, let’s assume we are sampling fish in the James River. The hypothesis is that the frequency of catfish is
\[ p_{catfish} = \frac{N_{catfish}}{N_{catfish} + N_{other}} = \frac{37}{50} \approx 0.74 \]
So if if we have another set of sample, we can test the null hypothesis:
\(H_O: p = 0.74\)
Using the new counts of \(N_{catfish}\) and \(N_{other}\).
R has a direct function for this:
That allows testing two-tailed or single-sided alternative hypotheses.
So, let’s assume you went out and sampled new data and found
Exact binomial test
data: samples
number of successes = 52, number of trials = 99, p-value = 5.007e-06
alternative hypothesis: true probability of success is not equal to 0.74
95 percent confidence interval:
0.4223973 0.6265570
sample estimates:
probability of success
0.5252525
Conclusion?
As normal, the object that is returned by the function has all the necessary items for you to include the data into your manuscript.
When we have more than two unique states, we can extend the Binomial test to a more generalized Multinomial expansion. This would be used to test the hypothesis about specific probabilities for \(K-1\) states (the number of independently assorting groups). However, this is a specific version of a more generalilzed approach using Contingency Tables and I won’t be going over it here.
However, there are a few libraries in R available if you’d prefer to use the multinomial expansion.
Contingency tables are another counting approach, where your data can be classified into one or more categories. Often it is taken as a comparison between populations. So, in keeping with our fish theme, let’s assume we went to another river and did another sampling run on catfish.
This is defined as an \(RxC\) Contingency Table, where \(R\) is the number of Rows and \(C\) is the number of columns.
For this, we are assuming that the observations in the row are taken from a larger background population.
The same applies for samples from Population 2.
\(H_O: p_{catfish, \; pop1} = p_{catfish, \; pop2}\)
If we had more than just two Groups, we expand these to the vector of probabilities by row.
\(H_O: \vec{p}_i = \vec{p}_j\)
And the same applies for more than just two rows.
The test statistic for this is based upon:
The distance between the observed and expectations for the data.
Observed
The count of items in each combination of categories. - \(O_{11}\), \(O_{21}\), etc.
Expected
Based on number of categories OR external information.
Let’s assume I went out and sampled 50 more fishes and my overall data are now:
If we have no data, and the underlying hypothesis is that samples may be put into any of the Groups randomly (e.g., the underlying assumption that all fish species have equal abundance), then the expecation is that the frequencies of observations would be \(\frac{1}{K}\), where \(K\) is the number of groups.
In this example, we have a prior expectation that:
\[ p_{catfish} = 0.74 \]
\[ p_{other} = 1.0 - 0.74 = 0.26 \]
The test statistic should measure how close the data in these two matrices are.
This is based upon an approximation to the \(\chi^2\) distribution, which is well behaved and described for counting data.
\[ T = \sum_{i=1}^r\sum_{j=1}^c\frac{(O_{ij} - E_{ij})^2}{E_{ij}} \]
I’m going to show you how to estimate this from the raw count data so we can get an example where you have to look up the probability for an underlying distribution. There is a chisq.test() function, but you have to be very careful that you have it set up properly for contingency table analyses.
The general contingency table approach is based upon \(df = (r-1)(c-1)\) independent elements and is distributed as a \(\chi^2\) random variable. In a classic ‘hypothesis testing’ framework with \(\alpha = 0.05\), we would estimate the critical value of the test statistic as:
To get the actual probabilty associated with $T =$26.33, we pass the test statistic to the quantile function for the \(\chi^2\) distribution.
Sample Species Site Measurement
1 1 Catfish Population 1 8.205309
2 2 Catfish Population 1 11.497717
3 3 Catfish Population 1 13.045837
4 4 Catfish Population 1 11.666778
5 5 Catfish Population 1 12.075094
6 6 Catfish Population 1 11.045994
The table() function can take the columns of factors and provide for you the observed matrix of counts. If you name your factors descriptively, it will take care of the row and column headers for you.
There are equivalent tests (ok, not equivallent but can be used to answer a similar question) that do not have assumptions of normality, homoscedasticity, etc.
Important
These are largely based upon Ranks and Combinatory Theory.
You gain the applicability with a potential loss of information (and hence statistical power).
The normal cor.test() function has an option to swap out the Pearson-Product Moment approach for one based on Ranks.
I’m going to use the Iris data set to look at a correlation for relatively poorly behaved data.
fit_pearson <- cor.test( iris$Sepal.Length, iris$Sepal.Width )
fit_kendall <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "kendall" )
fit_spearman <- cor.test( iris$Sepal.Length, iris$Sepal.Width, method = "spearman" )
data.frame( Model=c("Pearson","Kendall","Spearman"),
Parameter = c("r","tau","rho"),
R = c(fit_pearson$estimate, fit_kendall$estimate, fit_spearman$estimate),
P = c(fit_pearson$p.value, fit_kendall$p.value, fit_spearman$p.value ) ) -> df Here is the difference in the models (the ties impacted the Spearman p.value estimate).
There are a few different methods for estimating linear regression models via non-parametric approaches. We’ve seen one common approach whenever we used the stat_smooth() function in ggplot(). By default, it uses the loess() local smoother approach.
In a lm(), the goal is to use least cost approaches to minimize the error variance across all the data simultaneously.
The loess() approach looks at a local window for estimating the formula of the line along with a moderatly high-dimensional polynomial.
\[ \hat{y} = \hat{\beta}_O^{x} + \hat{\beta}_1^{x}x_i \]
The function has a signature similar to that we saw in lm() for normal regression.
Notice the optional value for span, which defines the size of the local window.
Let’s look at the consequence of the span parameter.
One method suggested to figure out what an optimal value of span would be is based upon a generalized cross validation (GCV) approach.
\[ s = \frac{ \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_{i,s})^2 }{ (1 - \frac{v_s}{n} )^2 } \]
where \(v_s = \sum_{i=1}^N h_{ii,s}\) is the trace of the idempotent \(\mathbf{H}\) Hat Matrix.
Here is a function that does this for the general loess model.
loess.gcv <- function(x, y){
nobs <- length(y)
xs <- sort(x, index.return = TRUE)
x <- xs$x
y <- y[xs$ix]
tune.loess <- function(s){
lo <- loess(y ~ x, span = s)
mean((lo$fitted - y)^2) / (1 - lo$trace.hat/nobs)^2
}
os <- optimize(tune.loess, interval = c(.01, 99))$minimum
lo <- loess(y ~ x, span = os)
data.frame(Sepal.Length = x, Sepal.Width = lo$fitted, span = os)
}There are several additional approaches including Kernel Regression (e.g., Nadaraya & Watson’s Kernel Smoother) and Local Averaging (e.g., Friedman’s Super Smoother). If you need to perform a regression approach, please explore the benefits and challenges with these alternative methods before deciding on which one to use.
This is an extension of the t-test for the equivalence of the mean values of two data sets (e.g., \(H_O: \bar{x} = \bar{y}\)).
This approach has the following assumptions:
To test this, we use the Wilcoxon Rank Sum test. The signature of this test is:
This will take the full set of data, assign them ranked values, and then determine if the ranked values are randomly distributed between the two groups.
As for other analyses, the object returned has a number of elements that may be helpful in writing about the analysis.
The Kruskal-Wallis test is an extension on the Wilcoxon test in the same way that the aov() is an extension of the t-test(). The approach is again based upon ranks. Here is an example data set we’ve used once before based on air quality measured across five different months.
To test for the equality of ozone across months, we use kruskal.test()
As for other analyses, the object returned has a number of elements that may be helpful in writing about the analysis.